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Abstract 

We present MikeTalk, a text-to- audiovisual speech synthesizer which converts input text into an audiovisual 
speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth 
shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed 
to elicit one instantiation of each viseme. Using optical flow methods, correspondence from every viseme to every 
other viseme is computed automatically. By morphing along this correspondence, a smooth transition between 
viseme images may be generated. A complete visual utterance is constructed by concatenating viseme transitions. 
Finally, phoneme and timing information extracted from a text-to-speech synthesizer is exploited to determine which 
viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to 
synchronize the visual speech stream with the audio speech stream, and hence give the impression of a photorealistic 
talking face. 
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1 Introduction 



The goal of the work described in this paper is to 
develop a text-to-audiovisual speech synthesizer called 
MikeTalk. MikeTalk is similar to a standard text-to- 
speech synthesizer in that it converts text into an audio 
speech stream. However, MikeTalk also produces an 
accompanying visual stream composed of a talking face 
enunciating that text. An overview of our system is 
shown in Figure 1. 

Text-to- visual (TTVS) speech synthesis systems are 
attracting an increased amount of interest in the recent 
years, and this interest is driven by the possible deploy- 
ment of these systems as visual desktop agents, digital 
actors, and virtual avatars. In addition, they may also 
have potential uses in special effects, very low bitrate 
coding schemes (MPEG4) , and would also be of inter- 
est to psychologists who wish to study visual speech 
production and perception. 

In this work, we are particularly interested in build- 
ing a TTVS system where the facial animation is video- 
realistic: that is, we desire our talking facial model 
to look as much as possible as if it were a videocam- 
era recording of a human subject, and not that of a 
cartoon- like human character. 

In addition, we choose to focus our efforts on the 
issues related to the synthesis of the visual speech 
stream, and not on audio synthesis. For the task of 
converting text to audio, we have incorporated into 
our work the Festival speech synthesis system, which 
was developed by Alan Black, Paul Taylor, and col- 
leagues at the University of Edinburgh [6] . Festival is 
freely downloadable for non-commercial purposes, and 
is written in a modular and extensible fashion, which 
allows us to experiment with various facial animation 
algorithms. 

The Festival TTS system, as with most speech syn- 
thesizers, divides the problem of converting text to 
speech into two sub-tasks, shown in pink in Figure 1: 
first, a natural language processing (NLP) unit con- 
verts the input text into a set of output streams that 
contain relevant phonetic, timing, and other intona- 
tional parameters. Second, an audio signal processing 
unit converts the NLP output streams into an audio 
stream in which the input text is enunciated. 

Within this framework, our goal in this work, as 
depicted in red in Figure 1, is two- fold: first, to de- 
velop a visual speech module that takes as input the 
phonetic and timing output streams genererated by 
Festival's NLP unit, and produces as output a visual 
speech stream of a face enunciating the input text. Sec- 
ondly, to develop a lip-sync module that synchronizes 
the playback of the audio and visual streams. 



Text- 




Figure 1 . Overview of the MikeTalk TTVS sys- 
tem. 



We discuss the facial animation module in Sections 
2 through 6, and the lip-sync module in Section 7. 

2 Background and Motivation 

The main research issue underlying the construction 
of a TTVS visual stream is the nature of the facial 
model to use. One approach is to model the face us- 
ing traditional 3D modeling methods. Parke [22] was 
one of the earliest to adopt such an approach by creat- 
ing a polygonal facial model. The face was animated 
by interpolating the location of various points on the 
polygonal grid. Parke's software and topology were 
subsequently given new speech and expression control 
software by Pearce, Wyvill, Wyvill, & Hill [23]. With 
this software, the user could type a string of phonemes 
that were then converted to control parameters which 
changed over time to produce the animation sequence. 
Each phoneme was represented in a table by 11 control 
parameters, and the system made a transition between 
two phonemes by interpolating between the control pa- 
rameters. Recent work on TTVS systems that is based 
on Parke's models include the work of Cohen & Mas- 
saro [10] and LeGoff & Benoit [16]. 

To increase the visual realism of the underlying fa- 
cial model, the facial geometry is frequently scanned in 
using three-dimensional or laser scanners such as those 
manufactured by Cyberware. Additionally, a texture- 
map of the face extracted by the Cyberware scanner 
may be mapped onto the three-dimentional geometry 
[15]. More advanced dynamic, muscle-based animation 
mechanisms were demonstrated by Waters [26]. 



Despite these improvements, the generated facial an- 
imations still lack video realism. One Cyberware tex- 
ture scan alone does not suffice to capture the complex 
time- varying appearance of a human face, and usually 
is not able to capture the 3D structure of human hair. 
Furthermore, overlaying the texture scan on top of a 
3D polygonal or muscle-based model exposes the de- 
ficiencies of these models in terms of their ability to 
animate human motion. 

An alternative approach is to model the talking face 
using image-based techniques, where the talking facial 
model is constructed using a collection of images cap- 
tured of the human subject. These methods have the 
potential of achieving very high levels of videorealism, 
and are inspired by the recent success of similar sample- 
based methods for speech synthesis [19]. 

Bregler, Covell, el al. [7] describe such an image- 
based approach in which the talking facial model is 
composed of a set of audiovisual sequences extracted 
from a larger audiovisual corpus. Each one of these 
short sequences is a triphone segment, and a large 
database with all the acquired triphones is built. A new 
audiovisual sentence is constructed by concatenating 
the appropriate triphone sequences from the database 
together. To handle all the possible triphone contexts, 
however, the system requires a library with tens and 
possibly hundreds of thousands of images, which seems 
to be an overly-redundant sampling of human lip con- 
figurations. 

Cosatto and Graf [11] describe an approach which 
attempts to reduce this redundancy by parameterizing 
the space of lip positions. The imposed dimensions of 
this lip space are lip width, position of the upper lip, 
and position of the lower lip. This 3-dimensional lip 
space is then populated by using the images from the 
recorded corpus. Synthesis is performed by traversing 
trajectories in this imposed lip space. The trajectories 
are created using Cohen-Massaro's coarticulation rules 
[10]. If the lip space is not populated densely, the an- 
imations produced may be jerky. The authors use a 
pixelwise blend to smooth the transitions between the 
lip images, but this can produce undesirable ghosting 
effects. 

An even simpler image-based lip representation was 
used by Scott, Kagels, et al. [24] [27]. Their facial 
model is composed of a set of 40-50 visemes, which are 
the visual manifestation of phonemes. To animate the 
face, a 2D morphing algorithm is developed which is 
capable of transitioning smoothly between the various 
mouth shapes. While this produces smooth transitions 
between the visemes, the morphing algorithm itself re- 
quires considerable user intervention, making the pro- 
cess tedious and complicated. 




Figure 2. The MikeTalk facial model. 



Our work explores further the use of this viseme- 
morphing representation for synthesis of human visual 
speech. Instead of using a manual morphing method, 
however, we employ a method developed by Beymer, 
Shashua, Poggio [5] . This morphing algorithm required 
little or no user intervention, and was shown to be ca- 
pable of modeling rigid facial transformations such as 
pose changes, as well as non-rigid transformations such 
as smiles. 

3 Our Facial Model 

Our facial modelling approach may best be summa- 
rized as an image-based, morphing method, and is close 
in spirit to the work of [5] [24] [27] . We summarize the 
three main aspects of our facial model as follows: 

Corpus and Viseme Acquisition: First, a visual 
corpus of a subject enunciating a set of key words 
is initially recorded. Each key word is chosen 
so that it visually instantiates one American En- 
glish phoneme. Because there are 40-50 Ameri- 
can English phonemes [20] , the subject is asked to 
enunciate 40-50 words. One single image for each 
phoneme is subsequently identified and manually 
extracted from the corpus sequence. In this work, 
we use the term viseme to denote the lip image 
extracted for each phoneme. A few of the viseme 
images are depicted in Figure 2. 

Viseme Morph Transformation: Secondly, we 
construct, in a manner described in more detail 



below, a morph transformation from each viseme 
image to every other viseme image. This trans- 
formation allows us to smoothly and realistically 
transition between any two visemes, creating in- 
termediate lip shape images between the two end- 
points. For N visemes in our final viseme set, we 
define iV 2 such transformations. The arrows be- 
tween the viseme images in Figure 2 are a figura- 
tive depiction of these transformations. 

Concatenation Finally, to construct a novel visual 
utterance, we concatenate viseme morphs. For ex- 
ample, in terms of Figure 2, the utterance for the 
word one is constructed by morphing from \w\ 
viseme to the \uh\ viseme, followed by a morph 
from the \uh\ viseme to the \n\ viseme. For any 
input text, we determine the appropriate sequence 
of viseme morphs to make, as well as the rate of 
the transformations by utilizing the output of the 
natural language processing unit of the Festival 
TTS system. 

In a graph-theoretic sense, our facial model may be 
viewed as an N-node clique, where each node repre- 
sents one viseme, and the directed edges between nodes 
represent the N 2 viseme transformations. From an ani- 
mator's perspective, the visemes in our model represent 
keyframes, and our transformations represent a method 
of interpolating between them. 

In the following sections, we describe the various 
aspects of our approach in detail. 

4 Corpus and Viseme Data Acquisition 

The basic underlying assumption of our facial syn- 
thesis approach is that the complete set of mouth 
shapes associated with human speech may be reason- 
ably spanned by a finite set of visemes. The term 
viseme itself was coined initially by Fisher [12] as an 
amalgamation of the words "visual" and "phoneme" . 
To date, there has been no precise definition for the 
term, but in general it has come to refer to a speech 
segment that is visually contrastive from another. In 
this work, we define a viseme to be a static lip shape 
image that is visually contrastive from another. 

Given the assumption that visual speech is spanned 
by a set of visemes, we would like to design a particular 
visual corpus which elicits one instantiation for each 
viseme. One possible strategy to adopt is to assume a 
one-to-one mapping between the set of phonemes and 
the set of visemes, and design the corpus so that there 
is at least one word uttered which instantiates each 
phoneme. 



This one-to-one strategy is a reasonable approach 
in light of the fact that we plan on using the Festival 
TTS system to produce the audiovisual sequence. In 
doing so, Festival's NLP unit will produce a stream 
of phonemes corresponding to the input text. Con- 
sequently, there is a need to map from the set of 
phonemes used by the TTS to a set of visemes so as 
to produce the visual stream. A one-to-one mapping 
between phonemes and visemes thus ensures that a 
unique viseme image is associated with each phoneme 
label. Since most speech textbooks and dictionaries 
contain a list of phonemes and example words which 
instantiate them, the corpus may thus be designed to 
contain those example words. 

Our recorded corpus is shown in Figure 3. The 
example words uttered are obtained from [20], and 
are generally categorized into example words which in- 
stantiate consonantal, monophthong vocalic, and diph- 
thong vocalic phonemes. After the whole corpus is 
recorded and digitized, one lip image is extracted as 
an instance of that viseme. In general, the viseme im- 
age extracted was chosen as the image occurring at the 
point where the lips were judged to be at their extremal 
position for that sound. 

It should be noted that diphthongs are treated in 
a special manner in this work. Since diphthongs are 
vocalic phonemes which involve a quick transition be- 
tween two underlying vowel nuclei, it was decided that 
two viseme images were necessary to model that diph- 
thong visually: one to represent the first vowel nucleus, 
and the other to represent the second. Consequently, 
we extract two images for every diphthong from the 
recorded corpus. 

The one-to-one mapping strategy thus leads to the 
extraction of 52 viseme images in all: 24 represent- 
ing the consonants, 12 representing the monophthongs, 
and 16 representing the diphthongs. 

Since a large number of the extracted visemes looked 
similar, it was decided to further reduce the viseme 
set by grouping them together. This was done in a 
subjective manner, by comparing the viseme images 
visually to assess their similarity. This is in keep- 
ing with the current viseme literature, which indicates 
that the mapping between phonemes and visemes is, 
in fact, many-to-one: there are many phonemes which 
look alike visually, and hence they fall into the same 
visemic category. This is particularly true, for exam- 
ple, in cases where two sounds are identical in man- 
ner and place of articulation, but differ only in voicing 
characteristics. For example, \b\ and \p\ are two bil- 
abial stops which differ only in the fact that the former 
is voiced while the latter is voiceless. This difference, 
however, does not manifest itself visually, and hence 



the two phonemes should be placed in the same visemic 
category. In grouping the visemes subjectively, the au- 
thors were guided by the conclusions in [21] for the 
case of consonantal visemes, and in [18] for the case of 
vocalic visemes. 

It is also important to point out that that the 
map from phonemes to visemes is also one-to-many: 
the same phoneme can have many different visual 
forms. This phenomenon is termed coarticulation, and 
it occurs because the neighboring phonemic context in 
which a sound is uttered influences the lip shape for 
that sound. For example, the viseme associated with 
\t\ differs depending on whether the speaker is ut- 
tering the word two or the word tea. In the former 
case, the \t\ viseme assumes a rounded shape in an- 
ticipation of the upcoming \uu\ sound, while the lat- 
ter assumes a more spread shape in anticipation of the 
upcoming \ii\ sound. To date, and largely due to 
the large number of influencing factors, the nature and 
scope of coarticulation remains an open research prob- 
lem. The reader is referred to [10] for an in-depth dis- 
cussion on the theories behind coarticulation, and to 
[21] for a study on consonantal perception in various 
vocalic contexts. At the present stage of our work, we 
have decided for the sake of simplicity to ignore coar- 
ticulation effects, 

The final reduced set of visemes are shown in Figures 

4 and 5. Note that while our figures display only the 
region around the mouth, our viseme imagery capture 
the entire face. 

In all, there are 16 final visemes. Six visemes repre- 
sent the 24 consonantal phonemes. Seven visemes rep- 
resent the 12 monophthong phonemes. In the case of 
diphthongs, it was found that all vowel nuclei could be 
represented by corresponding monophthong visemes. 
The only exception to this occurred in the case of two 
nuclei: the second nucleus of the \au\ dipththong, 
which we call the \w-au\ viseme, and the first nu- 
cleus of the \ou\ dipththong, which we call the \o-ou\ 
viseme. Finally, one extra viseme was included to rep- 
resent silence, which we call \#\. 

5 Viseme Morphing 

In constructing a visual speech stream, it is not suffi- 
cient to simply display the viseme images in sequence. 
Doing so would create the disturbing illusion of very 
abrupt mouth movement, since the viseme images differ 
significantly from each other in shape. Consequently, a 
mechanism of transitioning from each viseme image to 
every other viseme image is needed, and this transition 
must be smooth and realistic. 

In this work, a morphing technique was adopted to 
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Figure 3. The recorded visual corpus. The un- 
derlined portion of each example word iden- 
tifies the target phoneme being recorded. To 
the left of each example word is the phone- 
mic transcription label being used. 



create this transition. Morphing was first popularized 
by Beier & Neely [3] in the context of generating tran- 
sitions between different faces for Michael Jackson's 
Black or White music video. Given two images Iq and 
Ii , morphing generates intermediate images I a , where 
a is a parameter ranging from to 1. These interme- 
diate images are generated by warping Iq towards I\, 
warping I\ towards Jo, and cross-dissolving the warped 
images to produce the final desired image. Intuitively, 
warping maybe viewed as an interpolation in shape, 
while cross-dissolving maybe viewed as an interpola- 
tion in texture. 

In the following sections, we discuss each of the 
above steps in detail, and describe their effect on 
viseme images in particular. 





/p, b, m/ II, v/ /t,d,s,z,th,dh/ 
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Figure 4. The 6 consonant visemes 



5.1 Correspondence 

As a first step, all morphing methods require the 
specification of correspondence maps Co : Jo =>■ Ji and 
Ci : I\ =>■ Jo relating the images Jo and I\ to each 
other. These maps serve to ensure that the subsequent 
warping process preserves the desired correspondence 
between the geometric attributes of the objects to be 
morphed. 

In this work, we choose to represent the correspon- 
dence maps using relative displacement vectors: 



C (po)-{dr i (Po),<^ 1 (Po)} 

Ci(pi) = {<£*°(pi)>dJ" (pi)} 



(1) 

(2) 



A pixel in image Jo at position p = (x, y) corresponds 
to a pixel in image I\ at position [x + d^ 1 {x, y),y + 
dy~^ 1 (x,y)). Likewise, a pixel in image I\ at position 
Pi = (x,y) corresponds to a pixel in image Jo a t po- 
sition (x + dl~*°(x, y),y + d y ~*°(x, y)). As discussed in 
[25], two maps are usually required because one map 
by itself may not be one-to-one. 

The specification of the correspondence maps Co 
and C\ between the images is typically the hardest 
part of the morph. Previous methods [3] [24] [14] 
have adopted feature-based approaches, in which a set 
of high-level shape features common to both images 
is specified. The correspondences for the rest of the 
points are determined using interpolation. 

When it is done by hand, however, this feature spec- 
ification process can become quite tedious and compli- 
cated, especially in cases when a large amount of im- 
agery is involved. In addition, the process of specifying 
the feature regions usually requires choosing among a 
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Figure 5. The 7 monophthong visemes, 2 
diphthong visemes, and the silence viseme. 



large number of fairly arbitrary geometric primitives 
such as points, line segments, arcs, circles, and meshes. 
However, in our case the images to be morphed be- 
long to one single object that is undergoing motion: a 
talking face. The problem of specifying correspondence 
between two images thus reduces to the problem of es- 
timating the motion field of the underlying moving ob- 
ject! This observation, made in [5] and [9], is extremely 
significant because it allows us to make use of a large 
number of automatic motion estimation algorithms for 
the purpose of computing the desired correspondence 
between two images. In this work, we make use of op- 
tical flow algorithms to estimate this motion. 

5.2 Optical Flow 

Optical flow [13] was originally formulated in the 
context of measuring the apparent motion of objects 
in images. This apparent motion is captured as a two- 
dimensional array of displacement vectors, in the same 
exact format shown in Equations 1 and 2. Given two 
images Jo and I\ , computing optical flow with Jo as ref- 
erence image produces a correspondence map Cq , while 



computing optical flow with I\ as reference produces a 
correspondence map C\. 

Optical flow is thus of clear importance because it 
allows for the automatic determination of correspon- 
dence maps. In addition, since each pixel is effectively 
a feature point, optical flow allows us to bypass the 
need for choosing any of the afore-mentioned geomet- 
ric feature primitives. In this sense, optical flow is said 
to produce dense, pixel correspondence. 

There is currently a vast literature on this subject 
(see for example [2] for a recent review) , and several dif- 
ferent methods for computing flow have been proposed 
and implemented. In this work, we utilize the coarse- 
to-fine, gradient-based optical flow algorithms devel- 
oped by [4]. These algorithms compute the desired 
flow displacements using the spatial and temporal im- 
age derivatives. In addition, they embed the flow esti- 
mation procedure in a multiscale pyramidal framework 
[8], where initial displacement estimates are obtained 
at coarse resolutions, and then propagated to higher 
resolution levels of the pyramid. Given the size of our 
viseme imagery, we have found that these methods are 
capable of estimating displacements on the order of 5 
pixels between two images. 

For even larger displacements between visemes, we 
have found that a flow concatenation procedure is ex- 
tremely useful in estimating correspondence. Since the 
original visual corpus is digitized at 30 fps, there are 
many intermediate frames that lie between the chosen 
viseme images. The pixel motions between these con- 
secutive frames are small, and hence the gradient-based 
optical flow method is able to estimate the displace- 
ments. Consequently, we compute a series of consecu- 
tive optical flow vectors between each intermediate im- 
age and its successor, and concatenate them all into one 
large flow vector that defines the global transformation 
between the chosen visemes. Details of our flow con- 
catenation procedure itself are found in the appendix. 

It is not practical, however, to compute concate- 
nated optical flow between viseme images that are very 
far apart in the recorded visual corpus. The repeated 
concatenation that would be involved across the hun- 
dreds of intermediate frames leads to a considerably 
degraded final flow. Consequently, we have found that 
the best procedure for obtaining good correspondences 
between visemes is actually a mixture of both direct and 
concatenated flow computations: typically, an interme- 
diate frame is chosen that is simultaneously similar in 
shape to the chosen starting viseme, and also close in 
distance to the chosen ending viseme. Direct optical 
flow is then computed between the starting viseme and 
this intermediate frame, and concatenated optical flow 
is computed from the intermediate up to the ending 
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Figure 6. FORWARD WARP algorithm, 
which warps / forward along flow vectors adx 

and adj to produce i warped . 



viseme. The final flow from the starting viseme to the 
ending viseme is then itself a concatenation of both of 
these direct and concatenated subflows. 

5.3 Forward Warping 

Given two viseme images Iq and I\ , the first step of 
our morphing algorithm is to compute the correspon- 
dence map Co(po) = {rf"^ 1 (p ),rf"^ 1 (Po)} as dis- 
cussed in the previous section, and then to forward 
warp Iq along that correspondence. 

Forward warping may be viewed as "pushing" the 
pixels of Jo along the computed flow vectors. By scal- 
ing the flow vectors uniformly by the parameter a be- 
tween and 1 , one can produce a series of warped in- 
termediate images I™ arpe (q-) which approximate the 
transformation between visemes Jo and I\. Formally, 
the forward warp Wq and the warped image I™ arpe 
are computed as 



J 



Wo(p , a) = po + aCo(po) 
(W (p o ,a)) = J (po) 



warped 




(3) 
(4) 



A procedural version of our forward warp is shown in 
Figure 6. 

Several such intermediate warps are shown in Fig- 
ure 7a, where Jo is the \m\ viseme and I\ is the \aa\ 
viseme. The black holes which appear in the intermedi- 
ate images occur in cases where a destination pixel was 
not filled in with any source pixel value. The main rea- 
son for this is that the computed optical flow displace- 
ments exhibit nonzero divergence, particularly around 
the region where the mouth is expanding. 

In symmetric fashion, it is also possible to forward 
warp 1 1 towards Jo. First, the reverse correspondence 
map C\ is estimated by computing optical flow from I\ 
towards Jo . The forward warp W\ and warped interme- 
diate image I™ arpe (/3) are then generated analogously 
as in Equations 7 and 8: 
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Figure 7. A) Forward warping viseme I (first image) towards h. B) Forward warping viseme h (last 
image) towards I . C) Morphing the images in ./,, and /, together. D) The same morphed images as 
in C), after hole-filling and median-filtering. Note that our morphing algorithm operates on the entire 
facial image, although we only show the region around the mouth for clarity. 



Wi(pi,0)=pi+/K7i( Pl ) 

ir rpe \w x { Vl ,p)) =i x { Vl ) 



(5) 

(0) 



Note that in order to align the two forward warps with 
respect to each other, j3 is set to 1 — a. In this man- 
ner, as a moves from to 1, the warped intermediates 
jwarpe (]_ _ q,) start out with the fully warped versions 
and move towards the viseme I\. Several intermedi- 
ate images of this reverse forward warp are shown in 
Figure 7b. 

5.4 Morphing 

Because forward warping can only move pixels 
around, it cannot model the appearance of new pixel 
texture. As is evident from the sequence in Figure 7a, 
a forward warp of viseme 1$ along the flow vectors of Co 
can never produce a final image that looks like viseme 
I\, since viseme I\ itself contains a large amount of 
novel texture from the inside of the mouth. 

Morphing overcomes this "novel pixel texture" prob- 
lem by combining the texture found in both forward 



warps. This combination is performed by scaling 
the warped intermediate images with respective cross- 
dissolve or blending parameters, and then adding to 
produce the final morphed image I morph (a): 



I morph (p,a) = 
(1- 



warped ( 




rwarpea / \ 

^n y (p,a) 



at 



warped 



(P,(l-a))(7) 



By interpolating the blending parameters the morph 
"fades out" the warped versions of the starting viseme 
and "fades in" the warped versions of the ending 
viseme. The blending process thus allows the two 
warps to be effectively combined, and the "new" pixels 
of the second viseme to become involved in the viseme 
transition itself. 

Due to the presence of holes in the warped interme- 
diates, we adopt a slightly more sophisticated blend- 
ing mechanism: Whenever one forward warp predicts 
a pixel value in an intermediate image while the other 
leaves a hole, we set the final pixel value in the morphed 
image to be the same as that predicted by the single 
warp, without any blending. Whenever both forward 
warps predict pixel values in an intermediate image, we 



resort to the standard blending approach of Equation 
7. Holes are detected in the images by pre-filling the 
destination images with a special reserved color prior 
to warping. Our morphed intermediate image is thus 
synthesized as: 



I morph (p,a) 



jwarped 



(p, a) if l™ arpe (p 5 (i — a )) i s a hole "j 

l x (p, (1 — a)) ij 1 L (p, a) is a hole . 

( (1 — ct)I^ aTpe (p, a) + a I™ arpe (p, (1 — a)) otherwise J 

Figure 7c depicts several morphed images constructed 
in this manner. Note that our morphing algorithm 
operates on the entire facial image, although we only 
show the region around the mouth for clarity. 

As may be seen in the images of Figure 7c, there are 
locations for which neither warp predicts a pixel value, 
leaving a set of visible holes in the morphed images. 
To remedy this, a hole-filling algorithm proposed in [9] 
was adopted. The algorithm traverses the destination 
image in scanline order and fills in the holes by interpo- 
lating linearly between their non-hole endpoints. This 
approach works reasonably well whenever the holes are 
small in size, which is the case here. 

In addition, the morphed image occasionally ex- 
hibits "salt-and-pepper"-type noise. This occurs when- 
ever there is a slight mismatch in the brightness val- 
ues of neighboring pixels predicted by different viseme 
endpoints. To remove this, we convolve the hole-filled 
morphed image with a 3-by-3 median filter [17]. Figure 
7d shows the same set of morphed intermediates as in 
Figure 7c, but with the holes filled and median-filtered. 

Overall, we have found that the above morphing 
approach produces remarkably realistic transitions be- 
tween a wide variety of viseme imagery, including the 
typically "hard" morph transitions between open and 
closed mouths shown in Figure 7. It should be noted 
that our algorithm is essentially linear, in the sense 
that the warping and blending functions depend lin- 
early on the parameter a. As in [5] [14], it is pos- 
sible to use more complicated, non- linear warping and 
blending functions. We discuss the utility of using such 
functions briefly in the next section. 

6 Morph Concatenation 

To construct a visual stream in which a word or a 
sentence is uttered, we simply concatenate the appro- 
priate viseme morphs together. For example, the word 
one, which has a phonetic transcription of \w-uh-n\, 
is composed of the two viseme morphs \w-uh\ and 
\uh-n\ put together and played seamlessly one right 



after the other. It should be noted that while this con- 
catenation method guarantees Go geometric continu- 
ity, it does not guarantee continuity of speed, velocity, 
or acceleration. Future work must address this issue, 
since discontinuities in mouth motion are objectionable 
to the viewer. More advanced morph transition rates 
[14] are required to deal with this issue. 

7 Audiovisual Synchronization 

As discussed earlier, we have incorporated the Fes- 
tival TTS system [6] into our work. In a manner that 
is completely analogous to our method for concatenat- 
ing viseme morphs, Festival constructs the final au- 
dio speech stream by concatenating diphones together. 
Diphones are short audio sequences which sample the 
transitions between the middle of one phone to the 
middle of another phone. Given the presence of about 
40-50 phonemes in English, most diphone-based TTS 
systems record a corpus of about 1600-2500 diphones. 
Shown figuratively at the top in figure 8 is an audio 
stream for the word one, which is composed of the di- 
phones \w-uh\ and \uh-n\ concatenated together. 

In order to produce a visual speech stream in syn- 
chrony with the audio speech stream, our lip-sync mod- 
ule first extracts the duration of each diphone Di as 
computed by the audio module. We denote this du- 
ration (in seconds) as l(Di). Additionally, we com- 
pute the total duration T of the audio stream as 
T=EliKDi). 

Next, the lip-sync module creates an intermediate 
stream, called the viseme transition stream. A viseme 
transition is defined to be the collection of two end- 
point visemes and the optical flow correspondences be- 
tween them. The lip-sync module loads the appropriate 
viseme transitions into the viseme transition stream by 
examining the audio diphones. For example, if Di is a 
\uh-n\ diphone, then the corresponding viseme tran- 
sition Vi loaded by the lip-sync module is composed of 
the \uh\ viseme, the \n\ viseme, and the optical flow 
vectors between them. 

Two additional attributes are also computed for 
each viseme transition. The first is the duration of each 
viseme transition, l(Vi), which is set to be equal to the 
duration of the corresponding diphone l(Di). For rea- 
sons that will become clear shortly, it is also useful to 
compute the start index in time of each viseme transi- 
tion. We denote this as s(Vi), and compute it as 

s (^) = ( sty-!) + l(Vi-!) otherwise] ^ 
Thirdly, the lip-sync module creates the video 
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Figure 8. LIP-SYNC diagram 



tors in each viseme transition. In a scheme similar 
to one suggested by [9], we pre-compute the maximal 
offset vector for each viseme transition, and use it to 
determine if our constant sampling rate undersamples 
a viseme transition. If so, then we add more samples to 
the viseme transition until the motion rate is brought 
down to an acceptable level. Typically, we've found 
that oversampling a viseme transition to ensure no 
more than 2.5-3.0 pixel displacements between frames 
leads to acceptably smooth mouth motion. When a 
viseme transition is oversampled, the corresponding au- 
dio diphone is lengthened to ensure that synchrony be- 
tween audio and video is maintained. 



stream, which is composed of a sequence of frames 
which sample the chosen viseme transitions. Given a 
chosen frame rate F, the lip-sync module is required to 
create TF frames. As shown in figure 8, this naturally 
implies that the start index in time of the k'th frame 
is 



s{F k ) = k/F 



(9) 



Given the start index of each viseme transition (Equa- 
tion 8) and the start index of each frame (Equation 
9), the lip-sync algorithm determines how to synthe- 
size each frame F k by setting the morph parameter at 
for that frame to be 



Oik 



sjFk) - s(Vj) 



if s(F k ) - s(Vi) < l(Vi) 



(10) 

The morph parameter is thus simply the length of time 
elapsed from the start of a viseme transition to the 
frame, divided by the entire duration of the viseme 
transition itself. The condition on the right hand 
side of Equation 10 is there to ensure that the cor- 
rect viseme is chosen to synthesize a particular frame. 
In terms of figure 8, this condition would ensure that 
frames 1, 2, 3, and 4 are synthesized from the \w-uh\ 
viseme transition, while frames 5, 6, 7, 8 are synthe- 
sized from the \uh-n\ viseme transition. 

As a final step, each frame is synthesized using the 
morph algorithm discussed in Section 5.4. 

We have found that the use of TTS timing and 
phonemic information in this manner produces very 
good quality lip synchronization between the audio and 
the video. However, since the video sampling rate is 
constant and independent of mouth displacements, fre- 
quent undersampling of large mouth movement occurs, 
which can be very objectionable to the viewer. Ac- 
cordingly, we have found it necessary to increase our 
sampling rate adaptively based on the optical flow vec- 



8 Summary 

In summary, our talking facial model may be viewed 
as a collection of viseme imagery and the set of optical 
flow vectors defining the morph transition paths from 
every viseme to every other viseme. 

We briefly summarize the individual steps involved 
in the construction of our facial model: 

Recording the Visual Corpus: First, a visual cor- 
pus of a subject enunciating a set of key words is 
recorded. An initial one-to-one mapping between 
phonemes and visemes is assumed, and the sub- 
ject is asked to enunciate 40-50 words. One single 
image for each viseme is identified and extracted 
manually from the corpus. The viseme set is then 
subjectively reduced to a final set of 16 visemes. 

Building the Flow Database: Thirdly, we build a 
database of optical flow vectors that specify the 
morph transition from each viseme image to every 
other viseme image. Since there are 16 visemes 
in our final viseme set, a total of 256 optical flow 
vectors are computed. 

Synthesizing the New Audiovisual Sentence: 

Finally, we utilize a text-to-speech system [6] to 
convert input text into a string of phonemes, along 
with duration information for each phoneme. Us- 
ing this information, we determine the appropriate 
sequence of viseme transitions to make, as well as 
the rate of the transformations. The final visual 
sequence is composed of a concatenation of the 
viseme transitions, played in synchrony with the 
audio speech signal generated by the TTS system. 

9 Results 

We have synthesized several audiovisual sen- 
tences to test our overall approach for visual 



speech synthesis and audio synchronization de- 
scribed above. Our results may be viewed by 
accessing our World Wide Web home page at 
http: //cuneus .ai .mit . edu:8000/research/ 
miketalk/miketalk.html. The first author may also 
be contacted for a video tape which depicts the results 
of this work. 

10 Discussion 

On the positive side, our use of actual images as 
visemes allows to achieve a significant level of video- 
realism in the final facial model. And since we only 
need to sample the visemes themselves, and not all the 
possible transitions paths between them, the size of 
the visual corpus which needs to be recorded is small. 
The transitions between the visemes are computed in 
an off-line manner automatically using our optical flow 
techniques. 

The use of concatenated optical flow as a method 
to compute correspondence between visemes automat- 
ically seems to work very well, allowing us to over- 
come the difficulties associated with other correspon- 
dence methods which are manual and very tedious. In 
addition, the representation of a viseme transition as 
an optical flow vector allows us to morph as many in- 
termediate images as necessary to maintain synchrony 
with the audio produced by the TTS. 

Despite these advantages, there is clearly a large 
amount of further work to be done. First, there is 
no coarticulation model included in our facial synthe- 
sis method. This has the effect of producing visual 
speech that looks overly articulated. There is a clear 
need to use more sophisticated techniques to learn the 
dynamics of facial mouth motion. 

Also, there is a clear need to incorporate into our 
work nonverbal mechanisms in visual speech commu- 
nication such as eye blinks, eye gaze changes, eye- 
brow movements, and head nods. These communica- 
tion mechanisms would serve to make the talking facial 
model more lifelike. 

Since our method is image-based, we are constrained 
by the pose of the face in the imagery captured. Fu- 
ture work must address the need to synthesize the talk- 
ing facial model in different poses, and various recent 
image-based warping and morphing methods may be 
explored in this regard [1] [25] . 
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A Appendix: Flow Concatenation 

We briefly discuss the details of the flow concate- 
nation algorithm mentioned in section 5.2. Given a 
series of consecutive images 7o , I\ ,.../„, we would like 
to construct the correspondence map Co( n ) relating 1$ 
to /„ . Direct application of the optical flow algorithm 
may fail because the displacements in the images are 
too large. Because the images Iq,I\, . . .I n are the re- 
sult of a dense sampling process, the motion between 
consecutive frames is small, and hence we can com- 
pute optical flow between the consecutive frames to 
yield Crn, C12, ■ • • C(n-i)n- The goal is to concatenate 
Gqi,Ci2, ■ ■ ■ C( n -i)n together to yield an approxima- 
tion to the desired C („) map. We can view this prob- 
lem as the algebraic problem of adding vector fields. 

We focus on the case of the 3 images Jj_i,/j, Jj + i 
and the correspondences C(»_i)», Cj(i+i), since the con- 
catenation algorithm is simply is an iterative applica- 
tion of this 3- frame base case. Note that it is not cor- 
rect to construct C/j_iwj + i) (p) as the simple addition 
of C7j_i)j(p) + Cj(j +: n(p) because the two flow fields 
are with respect to two different reference images, 7j_i 
and Ii. Vector addition needs to be performed with 
respect to a common origin. 

Our concatenation thus proceeds in two steps: to 
place all vector fields in the same reference frame, the 
correspondence map Cj(j_|_i) itself is warped backwards 

[28] along C (i _i)j to create C^^ d . Now C^^ d and 
C(i-i)i a re both added to produce an approximation 
to the desired concatenated correspondence: 

C(i-i)(,+i)(p) = C(i_i)i(p) + C™ { ^ d (p) (11) 

A procedural version of our backwarp warp is shown 
in figure 9. BILINEAR refers to bilinear interpolation of 
the 4 pixel values closest to the point (x,y) 



for j = 0. . .height , 




for i = 0. . .width, 




x = i + dx(i,j) ; 




y = j + dy(i,j); 




jwarped^^) = BILINEAR ( I; 


x, y); 



Figure 9. BACKWARD WARP algorithm, 
which warps I backwards along dx and dy to 

produce i war P ed 



There are two algorithmic variants for concate- 
nation of n images, shown below. The first 
variant, CONCATENATION-DOWN, iteratively 
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computes 

concatenated flows C (rl _ 1)( „), C ( „_ 2 )(n), ■ • ■ , Co(n) us- 
ing the method discussed above. The desired fi- 
nal concatenated flow is Co( n ). The second vari- 
ant, CONCATENATION-UP, iteratively computes 
concatenated flows Com, Cq(2), ■ • ■ , C („). The desired 
final concatenated flow is C t n \. 



for i = n-1 downto do, 

if i = n - 1 then 

compute Cj(„)(p) using optical flow 

else 

compute Cj(j_|_i)(p) using optical flow 
warp C( i+ i)„(p) backwards along Cj( i+ i)(p) 

j_ j /^warped / \ 

to produce C (i+1 ^ n (p) 
set C i(n) (p) = C i(i+ i)(p) + C^"^f(p) 



Figure 10. CONCATENATION-DOWN al- 
gorithm 
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warp C(j_i)j(p; 

to produce 


j(p) using optical flow 
backwards along Co(i_i) 
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set C ow (p) = 


Co(i- 


-i)(p) + C ( i_i^ (p) 





Figure 11. CONCATENATION-UP algo- 
rithm 
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